Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick memory optimisations #776

Merged
merged 11 commits into from
Nov 4, 2024
Merged

Quick memory optimisations #776

merged 11 commits into from
Nov 4, 2024

Conversation

mauch
Copy link
Contributor

@mauch mauch commented Oct 23, 2024

These changes stop the quadrupling of dataframe in memory when using iterrows before upload to the database. There are also a few other minor changes to squeeze out a bit more memory at no cost to processing time.

It runs for me through forced extraction with ~2700 images - and through the measurment pairs with ~1000 images. The pairs dataframe just gets too big after 1000 images.

These changes should be good to hold over a 1.x release (no changes to the databse structure etc.) until further improvements in v2.x release which won't be backwards compatible.

@mauch mauch requested a review from ddobie October 23, 2024 03:06
@mauch mauch added the enhancement New feature or request label Oct 23, 2024
CHANGELOG.md Outdated
@@ -24,6 +25,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),

#### List of PRs

- [#776](https://github.com/askap-vast/vast-pipeline/pull/776): Minor memory optimisations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [#776](https://github.com/askap-vast/vast-pipeline/pull/776): Minor memory optimisations
- [#776](https://github.com/askap-vast/vast-pipeline/pull/776): fix: Minor memory optimisations

Although I realise that I didn't adhere to this formatting in the previously listed PR!!

vast_pipeline/pipeline/forced_extraction.py Show resolved Hide resolved
vast_pipeline/pipeline/forced_extraction.py Show resolved Hide resolved
vast_pipeline/utils/utils.py Show resolved Hide resolved
@mauch mauch requested a review from ddobie November 4, 2024 00:09
@mauch
Copy link
Contributor Author

mauch commented Nov 4, 2024

I've added some comments where you suggested. Also I've modified part of the revalidation step to save ~5GB of RAM for runs with ~2700 images.

Copy link
Contributor

@ddobie ddobie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I haven't run it myself, but you have and it passes tests so I'm happy to approve.

@mauch mauch merged commit 19883c2 into dev Nov 4, 2024
5 checks passed
@mauch mauch deleted the optimise_mem branch November 4, 2024 01:00
ddobie added a commit that referenced this pull request Nov 7, 2024
* Organise v1.1.1-dev

* Fix changelog formatting and update changelog instructions (#772)

* Initial changelog formatting issues

* Update changelog + instructions

* Updated changelog

* Updated Code of conduct (#773)

* Updated Code of conduct

* Updated changelog

* Fixed grammar

* Fix zenodo DOI

* Fixed typo in README

* Shorten forced fit measurement names (#734)

* Shorten names

* Updated changelog

* Update clearpiperun to use raw SQL (#775)

* timing and memory benchmark

* delete raw initial

* adding profiler

* optimisation handling exceptions

* Added logging

* Updated delete_run

* Fix syntax errors

* Disable triggers to see if that fixes speed issues

* Remove memory profiling

* Reenabled logging

* Add end of loop logging, remove tqdm

* Remove all tqdm, improve logging slightly

* Added timing

* Fixed tqdm missing

* Fix logging

* Added units to logging

* specify source id in logging

* Toggle triggers

* clean up clearpiperun

* Other minor updates

* Fix variable name

* Correctly handle images and skyregions that are associated with multiple runs

* PEP8

* Updated changelog

* Remove commented code

* Remove whitespace - don't know why the linter didn't pick this up

* Update vast_pipeline/management/commands/clearpiperun.py

Co-authored-by: Tom Mauch <[email protected]>

* Update vast_pipeline/utils/delete_run.py

Co-authored-by: Tom Mauch <[email protected]>

* Update vast_pipeline/utils/delete_run.py

Co-authored-by: Tom Mauch <[email protected]>

* Update vast_pipeline/utils/delete_run.py

Co-authored-by: Tom Mauch <[email protected]>

* Update vast_pipeline/utils/delete_run.py

Co-authored-by: Tom Mauch <[email protected]>

* Update vast_pipeline/utils/delete_run.py

Co-authored-by: Tom Mauch <[email protected]>

* Update vast_pipeline/utils/delete_run.py

Co-authored-by: Tom Mauch <[email protected]>

* Update vast_pipeline/management/commands/clearpiperun.py

Co-authored-by: Tom Mauch <[email protected]>

* Update vast_pipeline/management/commands/clearpiperun.py

Co-authored-by: Tom Mauch <[email protected]>

* Update vast_pipeline/management/commands/clearpiperun.py

Co-authored-by: Tom Mauch <[email protected]>

* Update vast_pipeline/utils/delete_run.py

Co-authored-by: Tom Mauch <[email protected]>

* Update vast_pipeline/utils/delete_run.py

Co-authored-by: Tom Mauch <[email protected]>

* Update vast_pipeline/utils/delete_run.py

Co-authored-by: Tom Mauch <[email protected]>

* Update vast_pipeline/utils/delete_run.py

Co-authored-by: Tom Mauch <[email protected]>

* Update vast_pipeline/utils/delete_run.py

Co-authored-by: Tom Mauch <[email protected]>

* Update vast_pipeline/utils/delete_run.py

Co-authored-by: Tom Mauch <[email protected]>

* Fix logging count

* Clean up logging statements

---------

Co-authored-by: Shibli Saleheen <[email protected]>
Co-authored-by: Tom Mauch <[email protected]>

* Quick memory optimisations (#776)

* Use itertuples over iterrows since iterrows is an enormous memory hog.

* Drop sources_df columns before renaming id column to avoid a copy of the while dataframe in memory.

* Decrease default partition size to 15MB

* Dont split (large-in-memory) list of DataFrames into dask bags (No performance hit).

* Don't write forced parquets in parallel (No perfomance hit for this).

* Dont overwrite input DataFrame when writing parquets.

* Update CHANGELOG.md

* Address review comments.

* Copy YAML objects before revalidation so the can be garbage collected.

* Appease flake8

* 750 configure workers (#777)

* Use itertuples over iterrows since iterrows is an enormous memory hog.

* Drop sources_df columns before renaming id column to avoid a copy of the while dataframe in memory.

* Decrease default partition size to 15MB

* Dont split (large-in-memory) list of DataFrames into dask bags (No performance hit).

* Don't write forced parquets in parallel (No perfomance hit for this).

* Initial configuration updates for processing options.

* Dont overwrite input DataFrame when writing parquets.

* Update CHANGELOG.md

* Address review comments.

* Copy YAML objects before revalidation so the can be garbage collected.

* Appease flake8

* Add processing options as optional with defaults.

* filter processing config to parallel association.

* Add a funtion to determine the number of workers and partitions for Dask.

* Use config values for num_workers and max_partition_size throughout pipeline.

* Correct working in config template.

* Update CHANGELOG.md

* Remove unused imports.

* Bump strictyaml to 1.6.2

* Use YAML 'null' to create Python None for all cores option.

* Make None the default in `calculate_workers_and_partitions` instead of 0

* Updated run config docs

* Allow null for num_workers_io and improve validation of processing parameters.

* Update num_workers_io default in docs.

---------

Co-authored-by: Dougal Dobie <[email protected]>

* Prepare v1.2.0 release

---------

Co-authored-by: Shibli Saleheen <[email protected]>
Co-authored-by: Tom Mauch <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants